Introductory Machine Learning Pipeline (Classification)¶

Author: Adrien Carrel

Classification tasks are tasks where the ultimate goal of the model is to return a categorical value by making a prediction. We assign to a certain set of values a class (or multiple classes).

Objective: Predict an increase in the vaccination coverage of a country in Latam based on previous data:

  • Media coverage of the words "COVID-19 Vaccination"
  • Increase/decrease of COVID cases/deaths
  • Increase of vaccination

Application: Generate early warning systems?

Import, load data, preprocessing, analysis & split¶

Main imports

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# change renderer to correctly exportno the plotly figures in the notebook
import plotly.io as pio
pio.renderers.default = "notebook"

Load daily vaccination data in Latam

In [ ]:
vac = pd.read_csv("daily_covid_vaccinations_latam.csv")
vac.head(5)
Out[ ]:
location 2020-12-24 2020-12-25 2020-12-26 2020-12-27 2020-12-28 2020-12-29 2020-12-30 2020-12-31 2021-01-01 ... 2022-05-17 2022-05-18 2022-05-19 2022-05-20 2022-05-21 2022-05-22 2022-05-23 2022-05-24 2022-05-25 2022-05-26
0 Antigua and Barbuda 0.0 0.0 0.0 0.0 0.0 0.00 0.00 0.0 0.0 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0
1 Argentina 0.0 0.0 0.0 0.0 0.0 0.04 0.09 0.1 0.1 ... 90.07 90.07 90.08 90.08 90.09 90.09 90.09 90.10 90.10 0.0
2 Bahamas 0.0 0.0 0.0 0.0 0.0 0.00 0.00 0.0 0.0 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0
3 Barbados 0.0 0.0 0.0 0.0 0.0 0.00 0.00 0.0 0.0 ... 56.35 56.37 56.38 0.00 56.39 0.00 56.40 56.41 56.42 0.0
4 Belize 0.0 0.0 0.0 0.0 0.0 0.00 0.00 0.0 0.0 ... 0.00 0.00 0.00 58.65 0.00 0.00 0.00 0.00 0.00 0.0

5 rows × 520 columns

Media cloud data in Latam. Keywords: "COVID-19 Vaccination"

In [ ]:
vac_media = pd.read_csv("covid_vaccination_attention_latam.csv")
vac_media.head(5)
Out[ ]:
date count_Antigua and Barbuda total_count_Antigua and Barbuda ratio_Antigua and Barbuda count_Argentina total_count_Argentina ratio_Argentina count_Bahamas total_count_Bahamas ratio_Bahamas ... ratio_Suriname count_Trinidad and Tobago total_count_Trinidad and Tobago ratio_Trinidad and Tobago count_Uruguay total_count_Uruguay ratio_Uruguay count_Venezuela total_count_Venezuela ratio_Venezuela
0 1/1/2020 0 3 0.0 2 2699 0.000741 0 12 0.0 ... 0.0 0 25 0.0 0 158 0.0 0 464 0.0
1 1/2/2020 0 6 0.0 4 6319 0.000633 0 51 0.0 ... 0.0 0 28 0.0 0 314 0.0 0 678 0.0
2 1/3/2020 0 10 0.0 5 7209 0.000694 0 60 0.0 ... 0.0 0 29 0.0 0 363 0.0 0 897 0.0
3 1/4/2020 0 7 0.0 0 3805 0.000000 0 39 0.0 ... 0.0 0 14 0.0 0 192 0.0 0 659 0.0
4 1/5/2020 0 8 0.0 1 3771 0.000265 0 34 0.0 ... 0.0 0 27 0.0 0 203 0.0 0 773 0.0

5 rows × 97 columns

Load COVID-19 cases and deaths in Latam

In [ ]:
cases = pd.read_csv("daily_covid_cases_latam.csv")
deaths = pd.read_csv("daily_covid_deaths_latam.csv")

Transpose the dataframe

In [ ]:
country_names = vac.T.iloc[0].tolist()  # list of countries
df = vac.T  # transpose dataframe
df.columns = country_names
df = df[1:]  # drop the first row that contains the country names
df.index = pd.to_datetime(df.index)
df = df.astype(float)  # force float type
In [ ]:
# describe our dataset
df.describe()
Out[ ]:
Antigua and Barbuda Argentina Bahamas Barbados Belize Bolivia Brazil Chile Colombia Costa Rica ... Nicaragua Panama Paraguay Peru Saint Lucia Saint Vincent and the Grenadines Suriname Trinidad and Tobago Uruguay Venezuela
count 519.000000 519.000000 519.000000 519.000000 519.000000 519.000000 519.000000 519.000000 519.000000 519.000000 ... 519.000000 519.000000 519.000000 519.000000 519.000000 519.000000 519.000000 519.000000 519.000000 519.000000
mean 7.631098 52.882042 2.240116 33.958593 7.376455 29.802620 46.839634 65.312543 20.688227 6.754451 ... 3.763121 6.809152 7.189865 40.600906 7.717264 4.802736 14.083083 24.887360 57.606108 1.245877
std 17.552802 34.505647 8.261362 20.815181 16.204873 22.657337 33.225499 30.656022 29.035664 20.816658 ... 16.247307 20.185087 14.041347 34.154956 11.804160 10.412137 17.913978 22.577068 31.828416 8.372928
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 16.120000 0.000000 21.085000 0.000000 4.750000 10.840000 42.740000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 3.085000 0.000000 0.000000 0.000000 0.000000 32.955000 0.000000
50% 0.000000 63.680000 0.000000 35.080000 0.000000 34.310000 57.800000 75.370000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 34.750000 0.000000 0.000000 0.000000 23.890000 75.960000 0.000000
75% 0.000000 86.935000 0.000000 54.050000 0.000000 53.880000 77.670000 91.050000 41.115000 0.000000 ... 0.000000 0.000000 5.080000 75.285000 16.955000 0.000000 32.875000 49.330000 79.295000 0.000000
max 64.740000 90.100000 41.940000 56.420000 58.650000 60.820000 85.840000 93.460000 82.350000 86.350000 ... 85.570000 79.270000 54.010000 87.830000 32.080000 32.810000 45.260000 53.360000 85.800000 77.190000

8 rows × 32 columns

Datasaurus Dozen: Always plot your data! Here are some examples of set of points with same x and y mean and variance + correlation.

datasaurus

Source of the image

Remark: If you work in a high dimensional space, use methods like t-SNE to project your data into lower dimensional spaces to plot them

In [ ]:
# plot vaccination data for Brazil
plt.plot(df["Brazil"])
plt.title("Always plot your data!")
plt.xlabel("Time")
plt.ylabel("Cumulated vaccination coverage (%)")
plt.xticks(rotation=45)
plt.show()

=> Apply cumulative max on your data

In [ ]:
df = df.cummax()  # apply cumulative max
plt.plot(df["Brazil"])
plt.title("Always plot your data!")
plt.xlabel("Time")
plt.ylabel("Cumulated vaccination coverage (%)")
plt.xticks(rotation=45)
plt.show()

Process vaccination data: from raw data to increase/constant on a daily basis (binary data)

In [ ]:
df = (df > df.shift(1)).astype(int)  # transform raw data into binary data
df.head(5)
Out[ ]:
Antigua and Barbuda Argentina Bahamas Barbados Belize Bolivia Brazil Chile Colombia Costa Rica ... Nicaragua Panama Paraguay Peru Saint Lucia Saint Vincent and the Grenadines Suriname Trinidad and Tobago Uruguay Venezuela
2020-12-24 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2020-12-25 0 0 0 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
2020-12-26 0 0 0 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
2020-12-27 0 0 0 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
2020-12-28 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 32 columns

Transpose deaths and cases

In [ ]:
# cases
country_names = cases["location"].tolist()
cases = cases.T
cases.columns = country_names
cases = cases[1:]
cases.index = pd.to_datetime(cases.index)

# deaths
country_names = deaths["location"].tolist()
deaths = deaths.T
deaths.columns = country_names
deaths = deaths[1:]
deaths.index = pd.to_datetime(deaths.index)

Plot cases, deaths, vaccination media coverage for Brazil as an example

In [ ]:
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(20, 9))
ax1.plot(cases["Brazil"])
ax2.plot(deaths["Brazil"])
ax3.plot(pd.to_datetime(vac_media["date"]), vac_media["ratio_Brazil"])
plt.xlabel("Time")
ax1.tick_params(labelrotation=45)
ax2.tick_params(labelrotation=45)
ax3.tick_params(labelrotation=45)
ax1.set_title("Daily COVID cases in Brazil")
ax2.set_title("Daily COVID deaths in Brazil")
ax3.set_title("Daily COVID-19 Vaccination media coverage in Brazil")
plt.show()

Apply same processing task to cases and deaths (transform the continuous timeseries into a binary one (increase vs decrease))

In [ ]:
# cases
cases = (cases > cases.shift(1)).astype(int)

# deaths
deaths = (deaths > deaths.shift(1)).astype(int)

Merge datasets

We crop the other datasets to keep only data from 2020-12-24 to 2022-05-26 (our vaccination dataset).

In [ ]:
first_date_df = pd.to_datetime(df.index[0])  # first date in vaccination data
last_date_df = pd.to_datetime(df.index[-1])  # last date in vaccination data
country_names = df.columns
# Media
for country in country_names:
    col = vac_media[[f"ratio_{country}"]]
    col.index = pd.to_datetime(vac_media.date)
    col = col[(col.index >= first_date_df) & (col.index <= last_date_df)]
    df[f"ratio_{country}"] = col[f"ratio_{country}"].to_numpy()
# Cases
for country in country_names:
    col = cases[[country]]
    col.index = pd.to_datetime(cases.index)
    col = col[(col.index >= first_date_df) & (col.index <= last_date_df)]
    df[f"cases_{country}"] = col[country].to_numpy()
# Deaths
for country in country_names:
    col = deaths[[country]]
    col.index = pd.to_datetime(deaths.index)
    col = col[(col.index >= first_date_df) & (col.index <= last_date_df)]
    df[f"deaths_{country}"] = col[country].to_numpy()
df.index = pd.to_datetime(df.index)  # datetime format to compare dates
df.head(5)
Out[ ]:
Antigua and Barbuda Argentina Bahamas Barbados Belize Bolivia Brazil Chile Colombia Costa Rica ... deaths_Nicaragua deaths_Panama deaths_Paraguay deaths_Peru deaths_Saint Lucia deaths_Saint Vincent and the Grenadines deaths_Suriname deaths_Trinidad and Tobago deaths_Uruguay deaths_Venezuela
2020-12-24 0 0 0 0 0 0 0 0 0 0 ... 0 1 0 1 0 0 0 0 0 0
2020-12-25 0 0 0 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 1 0 1 0
2020-12-26 0 0 0 0 0 0 0 1 0 0 ... 0 1 1 1 0 0 0 0 0 0
2020-12-27 0 0 0 0 0 0 0 1 0 0 ... 0 0 0 1 0 0 0 0 1 0
2020-12-28 0 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 1 0 0 1

5 rows × 128 columns

It doesn't make any sense to predict a variation in the number of vaccinations for a given country if nobody has been vaccinated yet in this country.

=> Extract the date of first vaccination for each country

In [ ]:
starting_dates = {}  # start when first vaccination occur in each country
for country in country_names:
    starting_dates[country] = pd.to_datetime(df[df[country] == 1].index[0])

Plot autocorrelation of variation in COVID-19 Vaccinations in Colombia

In [ ]:
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

plot_acf(x=df["Colombia"], lags=20)
plt.show()

Partial autocorrelation

In [ ]:
plot_pacf(x=df["Colombia"], lags=20)
plt.show()

We notice above that the significance is low with a lag greater than 7.

Let $k\in\mathbb{N}^{*}$ be a lag to extract multiple data in the past at each timestamp ($k=7$ means that we take present data and data from the 6 previous days to make a prediction for the next day).

Let $(V_{t})_{t\in\mathbb{N}}$, $(D_{t})_{t\in\mathbb{N}}$, $(C_{t})_{t\in\mathbb{N}}$, $(M_{t})_{t\in\mathbb{N}}$ be the timeseries of vaccination variations, COVID deaths variations, COVID cases variations, and media coverage of vaccination in the country.

We want to fit a model that can predict for all $t\geq k-1$ the value of $V_{t+1}$ which represent the increase (1) or decrease (0) in the number of COVID vaccinations between the days $t$ and $t+1$.

One can formulate the problem like that: find $f: \left ( \{0,1\}^{3}\times [0,1]\right )^{k}\rightarrow \{0,1\}$ such that $\forall t\geq k-1$, $V_{t+1}\approx f(V_{t}, D_{t}, C_{t}, M_{t}, \ldots, V_{t-k+1}, D_{t-k+1}, C_{t-k+1}, M_{t-k+1})=\widehat{V_{t+1}}$

The quality of the model will be evaluated using different metrics between our predictions $(\widehat{V_{t}})_{t\in\mathbb{N}}$ and the true values $(V_{t})_{t\in\mathbb{N}}$.

Let's try $k=7$ first (one week of data)

In [ ]:
k = 7  # last 7 days
X = []  # resulting dataset
features_name = ["increase_vac", "media_ratio", "increase_cases", "increase_deaths"]  # name of the features
columns_name = features_name[:]  # columns_name = total list of feature with lag features included
for j in range(1, k):
    for feat in features_name:
        columns_name.append(f"{feat}_l{j}")
columns_name.append("country_name")
columns_name.append("date")
columns_name.append("label")

# add lag features + country name (for interogation for bias later)
for country in country_names:
    features = [country, f"ratio_{country}", f"cases_{country}", f"deaths_{country}"]
    sub_df = df[features]
    sub_df = sub_df[sub_df.index >= starting_dates[country]]
    for j in range(1, k):
        for feat in features:
            sub_df[f"{feat}_l{j}"] = sub_df[feat].shift(j)

    sub_df["country_name"] = country  # add country name
    sub_df["date"] = pd.to_datetime(sub_df.index)  # add date (to sort later)
    sub_df["label"] = sub_df[country].shift(-1)  # add label (shift other direction)
    sub_df.dropna(inplace=True)  # drop nan values due to the shift(s)
    X.extend(list(sub_df.to_numpy()))

X = pd.DataFrame(X, columns=columns_name)
X.sort_values(by="date", inplace=True)  # sort by date !! (to avoid time related bias when evaluating test set)
y = X["label"]  # y is the label
X.drop(columns=["label", "date"], inplace=True)  # drop label and date
X.head(5)
Out[ ]:
increase_vac media_ratio increase_cases increase_deaths increase_vac_l1 media_ratio_l1 increase_cases_l1 increase_deaths_l1 increase_vac_l2 media_ratio_l2 ... increase_deaths_l4 increase_vac_l5 media_ratio_l5 increase_cases_l5 increase_deaths_l5 increase_vac_l6 media_ratio_l6 increase_cases_l6 increase_deaths_l6 country_name
3231 0 0.224383 0 1 0.0 0.244272 1.0 0.0 0.0 0.239538 ... 0.0 1.0 0.258086 1.0 0.0 1.0 0.333841 0.0 0.0 Chile
3232 0 0.270036 1 0 0.0 0.224383 0.0 1.0 0.0 0.244272 ... 0.0 1.0 0.285285 0.0 0.0 1.0 0.258086 1.0 0.0 Chile
9439 0 0.309479 0 0 0.0 0.346581 0.0 0.0 0.0 0.335810 ... 1.0 0.0 0.383148 0.0 1.0 1.0 0.358477 1.0 1.0 Mexico
3233 0 0.284797 0 1 0.0 0.270036 1.0 0.0 0.0 0.224383 ... 1.0 0.0 0.259018 1.0 0.0 1.0 0.285285 0.0 0.0 Chile
3234 0 0.241434 0 0 0.0 0.284797 0.0 1.0 0.0 0.270036 ... 0.0 0.0 0.239538 1.0 1.0 0.0 0.259018 1.0 0.0 Chile

5 rows × 29 columns

Size of the dataset

In [ ]:
print(f"Size of the dataset (nb_rows, nb_columns): {X.drop(columns=['country_name']).shape}")
Size of the dataset (nb_rows, nb_columns): (14389, 28)

Check missingness in our data (no missing data at all)

In [ ]:
print(f"Maximum amount of missing data among all features: {X.isnull().sum().max()}")
Maximum amount of missing data among all features: 0

Split to create a test set (20% of the data, the most recent ones) to evaluate a posteriori the performance of a model.

Remove also the country name variable from X (will be used later to evaluate biases)

In [ ]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X.drop(columns=["country_name"]), y, test_size=0.20, shuffle=False)

Models¶

Objective:

  • Try the following models: Random Forest, Logistic Regression, AdaBoost

  • Use a Time series cross-validation

  • Compare AUC and other metrics

  • Hyperparameter tuning

  • Performance on unseen data: evaluation of the model on the test set

  • Understand our model: sensitivity analysis

  • Interogation for bias: does the created model perform equally across the different countries in Latam?

Presentation of the models¶

Let $x=(x_{1}, \ldots, x_{n})\in\mathbb{R}^{n}$ be our features for a particular element of our dataset.

Logistic Regression¶

Find $\theta\in\mathbb{R}^{n}$ such that $p(label|data)=\sigma (\theta^{T}\cdot x)=\frac{1}{1+e^{-\theta^{T}\cdot x}}$.

If $p\geq 0.5$, predict class 1, else 0.

lr

Source of the image

Random Forest¶

Decision Tree:

Measure of impurity: Gini Index (but other exist, entropy, etc)

At node $t$, $Gini(t)=1-\sum_{j=1}^{J}(p(j|t))^{2}$ where $J$ is the number of classes (binary classification, $J=2$) and $p(j|t)$ is the relative frequency of class $j$ at node $t$.

When node $t$ is split on attribute $A$ into $k$ partitions (children), the quality of split is defined by: $Gini_{A}=\sum_{i=1}^{k}\frac{n_{i}}{n}Gini(i)$ where $n_{i}$ is the number of records at child $i$ and $n$ is the number of records at node $t$.

Training => until stopping criteria (records same attribute values/belong to the same class), split nodes $t$ by minimizing Gini.

decision_tree

Source of the image

Random Forest

Ensemble learning combine multiple algorithms to make better predictions by averaging them.

Random Forest is a bagging model, models make predictions in parallel. This increases the bias but decreases the variance.

Training => randomly select a subset of $m$ features and train a decision tree on a fraction of the training set (boostrap technique). Repeat the process to obtain $N$ decision trees.

randomforest

Source of the image

AdaBoost¶

AdaBoost is on the contrary a boosting model. This ensemble learning method reduces bias but increases variance.

Train model in cascade: model $i$ is used to train model $i+1$. At the end, aggregate many weak learners.

Dataset $D=((x_{i}, t_{i}))_{i\in\llbracket 1, n\rrbracket}$, start with a vector of equal weights $\forall i\in\llbracket 1, n\rrbracket, w_{i}^{(0)}=\frac{1}{n}$.

For all $t\in\mathbb{N}$, $y_{t}$ is the $t$-th model.

Train $y_{0}$ on the entire dataset with equal weights.

Let $J_{t}=\sum_{i=1}^{n}w_{i}^{(t)}\mathbb{1}(y_{t}(x_{i})\neq t_{i})$ be the cost function of classifier $t$.

=> Training

  • Calculate error rate of model $t$: $\gamma_{t}=\frac{J_{t}}{\sum_{i=1}^{n}w_{i}^{(t)}}$

  • Quality coefficient of classifier $t$ is: $\alpha_{t}=ln\left ( \frac{1-\gamma_{t}}{\gamma_{t}} \right )$

  • Update weights: $w_{i}^{(t+1)}=w_{i}^{(t)}e^{\alpha_{t}\mathbb{1}(y_{i}(x_{i})\neq t_{i})}$

=> Idea: insist on the "hard cases", misclassified ones.

Final model:

adaboost

Source of the image: CentraleSupélec, 2EL1730, ML Lecture 05.

Cross-validation for Time series¶

Fit the same model a multiple number of times on different splits of a dataset. Idea: make sure that we don't overfit a particular fold and that our model is truly learning patterns.

KFold cross-validation

CV_kfold

Source of the image

As we are working with timeseries, a Timeseries cross-validation is mandatory to avoid mixing future data and data from the past into the training+test set. Otherwise, the risk of overfitting is huge, and the model is more prone to dataset shift.

Time series cross-validation

cv

Source of the image

Baseline¶

Calculate proportion of our label. If 50% of our labels are labelized as 1 and the other half 0, an accuracy of 0.8 is great. However, if we have 80% of our labels equals to 1, and "model" constently predicting 1 will also have an accuracy of 80%.

In [ ]:
print(f"Proportion of class 1 in the training set: {round(100 * y_train.sum() / len(y_train), 2)}")
print(f"Proportion of class 1 in the test set: {round(100 * y_test.sum() / len(y_test), 2)}")
Proportion of class 1 in the training set: 53.33
Proportion of class 1 in the test set: 32.8

We notice a drop in proportion between our train set and our test set.

Training¶

Let's use the Vulpes 🦊 (zorro) Python package (still in pre-alpha) to train many models with a great flexibility in terms of which preprocessing pipeline we want to apply, which cross-validation technique we should use, which metric do we want to calculate, etc.

Link to the Vulpes package

In [ ]:
from vulpes.automl import Classifiers

classifiers = Classifiers(preprocessing=None, cv="timeseries")
models_dict = dict(classifiers.models_to_try)
models_to_try = [(m, models_dict[m]) for m in ["RandomForestClassifier", "LogisticRegression", "AdaBoostClassifier"]]
classifiers.models_to_try = models_to_try
classifiers.fit(X_train, y_train)
  0%|          | 0/3 [00:00<?, ?it/s]
RandomForestClassifier
 33%|███▎      | 1/3 [00:17<00:34, 17.22s/it]
LogisticRegression
 67%|██████▋   | 2/3 [00:17<00:07,  7.54s/it]
AdaBoostClassifier
100%|██████████| 3/3 [00:20<00:00,  6.86s/it]
Out[ ]:
Balanced Accuracy Accuracy Precision Recall F1 Score AUROC AUPRC Micro avg Precision Running time
Model
RandomForestClassifier 0.844237 0.843379 0.842033 0.844237 0.842232 0.907676 0.913777 0.912353 17.209869
LogisticRegression 0.835917 0.837435 0.835639 0.835917 0.835682 0.908409 0.919295 0.919390 0.765026
AdaBoostClassifier 0.834616 0.836392 0.834762 0.834616 0.834522 0.904284 0.913905 0.914010 2.601188

All models perform well.

Remark: we should normally look at the standard deviation of the metrics to see if some fold are over/underperforming.

Let's train on the entire training dataset now.

In [ ]:
classifiers.use_cross_validation = False
classifiers.fit(X_train, y_train)
  0%|          | 0/3 [00:00<?, ?it/s]
RandomForestClassifier
 33%|███▎      | 1/3 [00:01<00:03,  1.73s/it]
LogisticRegression
 67%|██████▋   | 2/3 [00:02<00:01,  1.32s/it]
AdaBoostClassifier
100%|██████████| 3/3 [00:03<00:00,  1.31s/it]
Out[ ]:
Balanced Accuracy Accuracy Precision Recall F1 Score AUROC AUPRC Micro avg Precision Running time
Model
RandomForestClassifier 0.829551 0.831090 0.832847 0.829551 0.830208 0.891767 0.889326 0.887335 1.728408
LogisticRegression 0.821478 0.822406 0.822758 0.821478 0.821855 0.894211 0.889794 0.889930 1.034183
AdaBoostClassifier 0.819542 0.820669 0.821343 0.819542 0.820004 0.889453 0.883121 0.883355 1.145862

Hyperparameter tuning¶

In [ ]:
# Import
import optuna
from sklearn.model_selection import (
    cross_validate,
    TimeSeriesSplit)
from sklearn.metrics import balanced_accuracy_score, make_scorer

# Extract our Random Forest pipeline (preprocessing + model, here no preprocessing)
pipe = classifiers.get_fitted_models()["RandomForestClassifier"]

# Cross validation and scoring method
cv = TimeSeriesSplit(n_splits=5)
scoring = {"balanced_accuracy": make_scorer(balanced_accuracy_score, greater_is_better=True)}

# Objective function: how we define which model is the best, how we tune hyperparameters
def objective(trial):
    # test values of max_depth and n_estimators
    randomforestclassifier__max_depth = trial.suggest_int("randomforestclassifier__max_depth", 2, 5)
    randomforestclassifier__n_estimators = trial.suggest_int("randomforestclassifier__n_estimators", 50, 500)
    # update model parameter
    model_params = {"randomforestclassifier__n_estimators": randomforestclassifier__n_estimators, "randomforestclassifier__max_depth": randomforestclassifier__max_depth}
    pipe.set_params(**model_params)
    # cross validation on train set
    cv_model = cross_validate(
        pipe,
        X_train,
        y_train,
        cv=cv,
        return_estimator=True,
        n_jobs=-1,
        fit_params={},
        scoring=scoring,
    )
    # return mean balanced accuracy on the different validation sets
    value = np.nanmean(cv_model["test_balanced_accuracy"])
    return value
In [ ]:
# Create our study

n_trials = 20  # number of steps to tune hyperparameters
# Create study
# direction="maximize": we want to maximize balanced accuracy
# pruner (optional): early stopping strategy
study = optuna.create_study(direction="maximize", study_name="Tuning",
                            pruner=optuna.pruners.MedianPruner(n_warmup_steps=10))
study.optimize(objective, n_trials=n_trials)

study_result = study.trials_dataframe()  # store the results in a dataframe
study_result.head(5)
[I 2022-08-03 01:06:19,328] A new study created in memory with name: Tuning
[I 2022-08-03 01:06:27,839] Trial 0 finished with value: 0.8429293996530143 and parameters: {'randomforestclassifier__max_depth': 5, 'randomforestclassifier__n_estimators': 385}. Best is trial 0 with value: 0.8429293996530143.
[I 2022-08-03 01:06:29,473] Trial 1 finished with value: 0.8382553343965146 and parameters: {'randomforestclassifier__max_depth': 2, 'randomforestclassifier__n_estimators': 81}. Best is trial 0 with value: 0.8429293996530143.
[I 2022-08-03 01:06:33,872] Trial 2 finished with value: 0.8388932846270759 and parameters: {'randomforestclassifier__max_depth': 3, 'randomforestclassifier__n_estimators': 263}. Best is trial 0 with value: 0.8429293996530143.
[I 2022-08-03 01:06:40,068] Trial 3 finished with value: 0.8400316510969397 and parameters: {'randomforestclassifier__max_depth': 4, 'randomforestclassifier__n_estimators': 325}. Best is trial 0 with value: 0.8429293996530143.
[I 2022-08-03 01:06:42,776] Trial 4 finished with value: 0.8419018290676166 and parameters: {'randomforestclassifier__max_depth': 5, 'randomforestclassifier__n_estimators': 109}. Best is trial 0 with value: 0.8429293996530143.
[I 2022-08-03 01:06:44,899] Trial 5 finished with value: 0.838602455901133 and parameters: {'randomforestclassifier__max_depth': 2, 'randomforestclassifier__n_estimators': 93}. Best is trial 0 with value: 0.8429293996530143.
[I 2022-08-03 01:06:52,109] Trial 6 finished with value: 0.8370566340750478 and parameters: {'randomforestclassifier__max_depth': 2, 'randomforestclassifier__n_estimators': 455}. Best is trial 0 with value: 0.8429293996530143.
[I 2022-08-03 01:06:55,831] Trial 7 finished with value: 0.8431237045681887 and parameters: {'randomforestclassifier__max_depth': 5, 'randomforestclassifier__n_estimators': 157}. Best is trial 7 with value: 0.8431237045681887.
[I 2022-08-03 01:07:02,615] Trial 8 finished with value: 0.8428656616183436 and parameters: {'randomforestclassifier__max_depth': 5, 'randomforestclassifier__n_estimators': 300}. Best is trial 7 with value: 0.8431237045681887.
[I 2022-08-03 01:07:05,032] Trial 9 finished with value: 0.8418662617126962 and parameters: {'randomforestclassifier__max_depth': 5, 'randomforestclassifier__n_estimators': 97}. Best is trial 7 with value: 0.8431237045681887.
[I 2022-08-03 01:07:09,051] Trial 10 finished with value: 0.8399832020340174 and parameters: {'randomforestclassifier__max_depth': 4, 'randomforestclassifier__n_estimators': 203}. Best is trial 7 with value: 0.8431237045681887.
[I 2022-08-03 01:07:15,742] Trial 11 finished with value: 0.8403376367558286 and parameters: {'randomforestclassifier__max_depth': 4, 'randomforestclassifier__n_estimators': 413}. Best is trial 7 with value: 0.8431237045681887.
[I 2022-08-03 01:07:23,134] Trial 12 finished with value: 0.8429293996530143 and parameters: {'randomforestclassifier__max_depth': 5, 'randomforestclassifier__n_estimators': 384}. Best is trial 7 with value: 0.8431237045681887.
[I 2022-08-03 01:07:26,556] Trial 13 finished with value: 0.8387392505230238 and parameters: {'randomforestclassifier__max_depth': 3, 'randomforestclassifier__n_estimators': 195}. Best is trial 7 with value: 0.8431237045681887.
[I 2022-08-03 01:07:30,350] Trial 14 finished with value: 0.8402376660597497 and parameters: {'randomforestclassifier__max_depth': 4, 'randomforestclassifier__n_estimators': 185}. Best is trial 7 with value: 0.8431237045681887.
[I 2022-08-03 01:07:39,370] Trial 15 finished with value: 0.8429875463091809 and parameters: {'randomforestclassifier__max_depth': 5, 'randomforestclassifier__n_estimators': 482}. Best is trial 7 with value: 0.8431237045681887.
[I 2022-08-03 01:07:48,499] Trial 16 finished with value: 0.8435038973716166 and parameters: {'randomforestclassifier__max_depth': 5, 'randomforestclassifier__n_estimators': 494}. Best is trial 16 with value: 0.8435038973716166.
[I 2022-08-03 01:07:53,070] Trial 17 finished with value: 0.8388897060785604 and parameters: {'randomforestclassifier__max_depth': 3, 'randomforestclassifier__n_estimators': 254}. Best is trial 16 with value: 0.8435038973716166.
[I 2022-08-03 01:07:56,186] Trial 18 finished with value: 0.8397807697268995 and parameters: {'randomforestclassifier__max_depth': 4, 'randomforestclassifier__n_estimators': 159}. Best is trial 16 with value: 0.8435038973716166.
[I 2022-08-03 01:08:03,302] Trial 19 finished with value: 0.8431261360769495 and parameters: {'randomforestclassifier__max_depth': 5, 'randomforestclassifier__n_estimators': 345}. Best is trial 16 with value: 0.8435038973716166.
Out[ ]:
number value datetime_start datetime_complete duration params_randomforestclassifier__max_depth params_randomforestclassifier__n_estimators state
0 0 0.842929 2022-08-03 01:06:19.336026 2022-08-03 01:06:27.839040 0 days 00:00:08.503014 5 385 COMPLETE
1 1 0.838255 2022-08-03 01:06:27.839040 2022-08-03 01:06:29.473370 0 days 00:00:01.634330 2 81 COMPLETE
2 2 0.838893 2022-08-03 01:06:29.488912 2022-08-03 01:06:33.872028 0 days 00:00:04.383116 3 263 COMPLETE
3 3 0.840032 2022-08-03 01:06:33.881227 2022-08-03 01:06:40.067004 0 days 00:00:06.185777 4 325 COMPLETE
4 4 0.841902 2022-08-03 01:06:40.069000 2022-08-03 01:06:42.775051 0 days 00:00:02.706051 5 109 COMPLETE

Visualize the hyperparameter tuning

In [ ]:
from optuna.visualization import plot_parallel_coordinate

plot_parallel_coordinate(study)
In [ ]:
from optuna.visualization import plot_contour

plot_contour(study)
In [ ]:
from optuna.visualization import plot_slice

plot_slice(study)
In [ ]:
from optuna.visualization import plot_param_importances

plot_param_importances(study)

Once we have an estimation of the best hyperparamters among the one tested, we can create our "best_rf" (Best Random Forest).

In [ ]:
tuned_rf = pipe
tuned_rf.set_params(**study.best_params)  # apply best parameters
# fit the tuned model on the entire train set
tuned_rf.fit(X_train, y_train)
tuned_rf
Out[ ]:
Pipeline(steps=[('randomforestclassifier',
                 RandomForestClassifier(max_depth=5, n_estimators=494,
                                        n_jobs=-1, random_state=42))])

ROC Curves¶

Plot ROC (Receiver Operating Characteristic) curves and compare their AUC (Area Under the Curve)

In [ ]:
from sklearn.metrics import roc_curve, auc

def plot_roc(fpr, tpr, roc_auc, model_name, plot_baseline=True):
    # plot ROC based on false positive rate, true positive rate and display roc_auc
    lw = 2  # line width
    plt.plot(
        fpr,
        tpr,
        lw=lw,
        label=f"AUC ROC {model_name} {round(roc_auc, 2)}",
    )
    if plot_baseline:
        plt.plot([0, 1], [0, 1], color="navy", lw=lw, linestyle="--", label="Baseline")
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
    plt.title("Receiver Operating Characteristic")
    plt.legend(loc="best")


def plot_roc_multiple(models, X_test, y_test):
    # plot multiple roc curve for multiple models
    plt.figure()
    # Compute ROC curve and ROC area
    is_first_model = True
    for model_name, model in models.items():
        y_pred = model.predict_proba(X_test)  # probabilities
        fpr, tpr, _ = roc_curve(y_test, y_pred[:, 1])
        roc_auc = auc(fpr, tpr)
        plot_roc(fpr, tpr, roc_auc, model_name, plot_baseline=is_first_model)
        is_first_model = False
    plt.show()

models = classifiers.get_fitted_models()  # dictionnary of {model_name: fitted_model}
models["tuned_rf"] = tuned_rf  # add tuned model
plot_roc_multiple(models, X_test, y_test)

Evaluate classification metrics for each class

In [ ]:
from sklearn.metrics import classification_report

model = tuned_rf  # let's evaluate our tuned model on the test set
y_pred = model.predict(X_test)
report = classification_report(y_test, y_pred)
print(report)
              precision    recall  f1-score   support

         0.0       0.83      0.88      0.85      1934
         1.0       0.72      0.64      0.68       944

    accuracy                           0.80      2878
   macro avg       0.78      0.76      0.77      2878
weighted avg       0.80      0.80      0.80      2878

Sensitivity analysis¶

For Random Forest, we can use MDI (Mean Decrease Impurity across all trees) among many other techniques to evaluate the importance of each feature.

In [ ]:
# extract feature importance using MDI with standard deviations

# !! the randomforest is the first element of the pipeline (we didn't apply any preprocessing on our dataset)
# => We access the model by using model[0]. Can be model[k] k>=1 in other cases

importances = model[0].feature_importances_  # feature importance
feature_names = model[0].feature_names_in_  # feature names
forest_importances = pd.DataFrame()
forest_importances["importances"] = importances
forest_importances.index = feature_names
std = np.std([tree.feature_importances_ for tree in model[0].estimators_], axis=0)  # std of feature importance across trees
forest_importances["std"] = std
forest_importances["yerr"] = 1.96 * forest_importances["std"] / len(model[0].estimators_)  # 95% confidence interval
forest_importances.sort_values(by="importances", ascending=False, inplace=True)

fig, ax = plt.subplots(figsize=(12, 8))
forest_importances["importances"].plot.bar(yerr=forest_importances["yerr"], ax=ax)
ax.set_title("Feature importances using MDI - 95% confidence interval")
ax.set_xlabel("Features")
ax.set_ylabel("Mean Decrease in Impurity")
plt.xticks(rotation=45)
fig.tight_layout()
  • Huge importance for previous increase in vaccination / values of media coverage in terms of vaccination.

  • Variations of deaths and cases during the week seem to be irrelevant as features in our model.

We can also plot one tree (an estimator of the random forest) to vizualize the decision that lead to a prediction in order to try to explain our model.

Plot only the 3 first depths.

In [ ]:
from sklearn import tree

fig, axes = plt.subplots(1, 1, figsize=(8, 8), dpi=300)
decision_tree = model[0].estimators_[0]
tree.plot_tree(decision_tree, max_depth=3, feature_names=feature_names,
               class_names=["Constant", "Increase"], filled=True,
               fontsize=4)
Out[ ]:
[Text(0.5, 0.9, 'increase_vac_l4 <= 0.5\ngini = 0.497\nsamples = 7277\nvalue = [5347, 6164]\nclass = Increase'),
 Text(0.25, 0.7, 'increase_vac_l1 <= 0.5\ngini = 0.39\nsamples = 3430\nvalue = [3963, 1433]\nclass = Constant'),
 Text(0.125, 0.5, 'increase_vac_l2 <= 0.5\ngini = 0.311\nsamples = 2518\nvalue = [3180, 758]\nclass = Constant'),
 Text(0.0625, 0.3, 'media_ratio_l1 <= 0.199\ngini = 0.296\nsamples = 2037\nvalue = [2567, 567]\nclass = Constant'),
 Text(0.03125, 0.1, '\n  (...)  \n'),
 Text(0.09375, 0.1, '\n  (...)  \n'),
 Text(0.1875, 0.3, 'increase_vac <= 0.5\ngini = 0.362\nsamples = 481\nvalue = [613, 191]\nclass = Constant'),
 Text(0.15625, 0.1, '\n  (...)  \n'),
 Text(0.21875, 0.1, '\n  (...)  \n'),
 Text(0.375, 0.5, 'increase_vac <= 0.5\ngini = 0.497\nsamples = 912\nvalue = [783, 675]\nclass = Constant'),
 Text(0.3125, 0.3, 'increase_vac_l2 <= 0.5\ngini = 0.366\nsamples = 495\nvalue = [604, 192]\nclass = Constant'),
 Text(0.28125, 0.1, '\n  (...)  \n'),
 Text(0.34375, 0.1, '\n  (...)  \n'),
 Text(0.4375, 0.3, 'media_ratio_l2 <= 0.186\ngini = 0.395\nsamples = 417\nvalue = [179, 483]\nclass = Increase'),
 Text(0.40625, 0.1, '\n  (...)  \n'),
 Text(0.46875, 0.1, '\n  (...)  \n'),
 Text(0.75, 0.7, 'increase_vac_l5 <= 0.5\ngini = 0.35\nsamples = 3847\nvalue = [1384, 4731]\nclass = Increase'),
 Text(0.625, 0.5, 'increase_vac <= 0.5\ngini = 0.475\nsamples = 802\nvalue = [775, 493]\nclass = Constant'),
 Text(0.5625, 0.3, 'increase_vac_l3 <= 0.5\ngini = 0.314\nsamples = 456\nvalue = [569, 138]\nclass = Constant'),
 Text(0.53125, 0.1, '\n  (...)  \n'),
 Text(0.59375, 0.1, '\n  (...)  \n'),
 Text(0.6875, 0.3, 'media_ratio_l1 <= 0.133\ngini = 0.465\nsamples = 346\nvalue = [206, 355]\nclass = Increase'),
 Text(0.65625, 0.1, '\n  (...)  \n'),
 Text(0.71875, 0.1, '\n  (...)  \n'),
 Text(0.875, 0.5, 'increase_vac_l2 <= 0.5\ngini = 0.22\nsamples = 3045\nvalue = [609, 4238]\nclass = Increase'),
 Text(0.8125, 0.3, 'media_ratio_l1 <= 0.119\ngini = 0.422\nsamples = 382\nvalue = [185, 427]\nclass = Increase'),
 Text(0.78125, 0.1, '\n  (...)  \n'),
 Text(0.84375, 0.1, '\n  (...)  \n'),
 Text(0.9375, 0.3, 'increase_vac <= 0.5\ngini = 0.18\nsamples = 2663\nvalue = [424, 3811]\nclass = Increase'),
 Text(0.90625, 0.1, '\n  (...)  \n'),
 Text(0.96875, 0.1, '\n  (...)  \n')]

Gini is low on the left/right extremities of the tree.

  • It seems that an increase in vaccination is likely to happen if it happend the days before.

  • Vaccination coverage was more likely to be constant when the vaccination coverage didn't increase the past few days and when the "COVID-19 Vaccination" media coverage is low nationwide.

Interogation for bias¶

Using the code below, we can evaluate our model per country.

In [ ]:
from sklearn.metrics import balanced_accuracy_score

_, X_test_cty, _, _ = train_test_split(X, y, test_size=0.20, shuffle=False)  # don't remove country_name this time from X
X_test_cty["label"] = y_test
country_list = []
acc_list = []
for country in X_test_cty["country_name"].unique():
    sub_df = X_test_cty[X_test_cty["country_name"] == country]  # filter attribute values and labels for this country in the test set
    y_test_2 = sub_df["label"]  # extract labels
    X_test_2 = sub_df.drop(columns=["country_name", "label"])  # remove these variables (not predictors)
    pred = model.predict(X_test_2)  # make predictions
    acc = balanced_accuracy_score(y_test_2, pred)  # calculate balanced accuracy
    country_list.append(country)
    acc_list.append(acc)
accuracy_per_country = pd.DataFrame()
accuracy_per_country["Country Name"] = country_list
accuracy_per_country["Balanced Accuracy"] = acc_list
accuracy_per_country.sort_values(by="Balanced Accuracy", ascending=False, inplace=True)  # sort the countries based on their accuracies
accuracy_per_country
Out[ ]:
Country Name Balanced Accuracy
24 Venezuela 1.000000
30 Uruguay 0.797354
19 Trinidad and Tobago 0.743172
31 Ecuador 0.702963
23 Chile 0.673963
2 Colombia 0.635714
11 Mexico 0.630946
10 Cuba 0.610836
25 Argentina 0.610294
1 Brazil 0.555556
0 Guyana 0.543736
20 Saint Vincent and the Grenadines 0.541667
18 Bolivia 0.538547
9 Paraguay 0.500000
22 Jamaica 0.500000
3 Bahamas 0.500000
28 Grenada 0.500000
26 Dominica 0.500000
4 Belize 0.500000
5 Guatemala 0.500000
6 Antigua and Barbuda 0.500000
21 Honduras 0.500000
12 Saint Lucia 0.500000
7 Suriname 0.500000
17 Costa Rica 0.500000
8 Haiti 0.500000
15 Peru 0.500000
14 El Salvador 0.500000
16 Nicaragua 0.500000
29 Panama 0.493243
27 Dominican Republic 0.488889
13 Barbados 0.486345

Bonus: Data Visualization¶

Let's plot the inequalities in terms of Balanced Accuracy of our model using a choropleth map.

In [ ]:
import plotly.express as px

# Load geojson file with coordinates in South America
from urllib.request import urlopen
import json
with urlopen('https://raw.githubusercontent.com/codeforgermany/click_that_hood/main/public/data/south-america.geojson') as response:
    geojson = json.load(response)
for feature in geojson["features"]:
    feature["id"] = feature["properties"]["name"]

fig = px.choropleth(accuracy_per_country, geojson=geojson, locations="Country Name", color="Balanced Accuracy",
                    color_continuous_scale="reds",
                    labels={"Balanced Accuracy": "Balanced Accuracy"}
                    )
fig.update_geos(fitbounds="locations")
# fig.update_layout(height=300, margin={"r":0,"t":0,"l":0,"b":0})
fig.update_layout(title="Choropleth map: Balanced accuracy of our model on the test set (per country)")
fig.show()

Going further ...¶

Reformulate the problem/the strategy:

  • Does it make more sense/is it more useful to predict: 0-No increase, 1-"Short" increase, 2-"Large" increase in vaccination,

  • Make one model per country / add a group parameter to stratify by country the folds during the training

Add more features:

  • Create artificial features (product of two features, use exp, log, sqrt functions, etc)

  • Incorporate Google Trends data, Media Cloud data with other keywords, country specific sociodemographics data, etc

More Exploratory Data Analysis (EDA):

  • Look at the distribution of the features

  • Correlation matrix

  • More plot to detect patterns/anomalies

Tips: Use tools like pandas-profiling

Preprocessing:

  • Normalization (on non-categorical data especially)

  • Feature selection (based on feature importance, correlation, etc)

  • ...

Models

  • Explore other models

  • Further tune the hyperparameters

  • Search for other sources of biases

Don't forget to evaluate as much as possible the impact on the performance of your preprocessing pipeline, your additional features, etc

Conclusion¶

More importantly

Be creative and have fun! <3 <3

Notebook on https://github.com/AdrienC21/makehealthlatam_introductorytraining_2022